GAN photo editing
CMU 16-726 Image Synthesis S22
Tomas Cabezon Pedroso
In this project, we will explore the posibilites of image synthesis and editing using GAN latent
space. First, we will start by inverting a pre-trained generator to find a latent variable that
closely
reconstructs a given real image. In the second part of the assignment, we will interpolate between
two images in the latent space and we will finisht with image editing. We will take a hand-drawn
sketch and generate an image that fits the sketch accordingly, then wi will use these sketches to
edit a given image.
This project is based in the following two articles: Generative Visual Manipulation
on the Natural Image Manifold and Image2StyleGAN: How to Embed Images Into the
StyleGAN Latent Space?.
Inverting the Generator
For the first part of the assignment, you will solve an optimization problem to reconstruct the
image from a particular latent code. Natural images lie on a
low-dimensional manifold and we choose to consider the output manifold of a trained generator as
close
to the natural image manifold. So, we can set up the following nonconvex optimization problem:
For some choice of loss
and trained generator
and a given real image
,
we can write
We choose a combination of pixel and perceptual loss, as the standard Lp losses do not work well for image synthesis tasks. We also tried BCE loss but didn't give good results. For the implementation of this part of this assigment we rehuse what we learnt in assigment 4, Neural Style transfer. As this is a nonconvex optimization problem where we can access gradients, we attempt to solve it with any first-order or quasi-Newton optimization method (in our case, LBFGS).
Perceptual and Pixel Loss: The content loss is a metric function that measures the content distance between two images at a certain individual layer. Denote the Lth-layer feature of input image X as fXLand that of target content image as fCL. The content loss is defined as squared L2-distance of these two features:
To extract the feature, a VGG-19 net pre-trained on ImageNet is used. The pre-trained VGG-19 net consists of 5 blocks (conv1-conv5) (with a total of 15 conv layers) and each block serves as a feature extractor at the different abstract levels. As we saw in previous assigment, the layer election has a big the influence on the results. For this assigment, we have used conv_5 layer, as it outputs best results.
For pixel loss, we implement the L1 loss over the pixel space.
Results
In this part of the assgiment we try different possible combinations of layers of VGG-19, different
perceptual and pixel loss weights as well as using different latent spaces (z, w and w+). We have
seen that the
best results are the ones using conv_5 layer, 0.01 weight for the perceptual loss and 10. for the
pixel loss. Nevertheless, we will use other weights for the following parts of the assigment. We
also compare the outputs of a vnailla GAN and a styleGAN and as it was expected, the seconds outputs
better results. We optimize the images during 1000 iterations as more optimization time does not
result in better output quality.
Interpolations
We use StyleGAN and w+ space to embedded two images into the latent space and output the images of their interpolation. In the following images and GIFs we can observe that the transitions are smooth and neat.
In the top right images we can see how the plants disapear on the embedded images, this is a good example of how stylegans keep the important overall features of the data they are trained on but dont learn smaller details.
Scribble to Image
We can treat the scribble similar to the input reconstruction. In this part, we have a scribble
and a mask, so we can modify the latent vector to yield an image that looks like the scribble.To
generate an image subject to constraints, we solve a penalized non-convex optimization problem.
We’ll assume the constraints are of the form
fi(x)=vi
for some scalar-valued functions
fi
and scalar values
vi.
Written in a form that includes our trained generator G, this soft-constrained optimization
problem is:
Given a user color scribble, we would like GAN to fill in the details. Say we have a hand-drawn scribble image with a corresponding mask . Then for each pixel in the mask, we can add a constraint that the corresponding pixel in the generated image must be equal to the sketch, which might look like Since our color scribble constraints are all elementwise, we can reduce the above equation under our constraints to .
where
is the Hadamard product,
is the mask, and
is the sketch. For the results bellow, we
have used a perceptual loss weight of 0.05 and L1 loss of 5.
Image Editing
Similar to in previous section, we will use the perceptual and pixel loss to edit an images. However, in this case we will embed the initial image in the latent space and then apply the sketch to edit it. In the following images some of the results of this image editing are shown. This images have been optained using conv_4 layer to calculate the loss. We can observe how some of the colors in the sketches are not present in the GAN latent space and therefore the output images, so similar colors, but not the same ones.
Bells & Whistles
High Resolution Grumpy Cats
We used a higher resolution GAN to generate out more detailed grumpy cats and their interpolations! Here are the results:
User interface and demo
In the following gifs, two possible user interfaces and interaction can be seen. The user is able to draw the edition sketches and the model optimizes the best image output that matches the initial image and the edition sketch.
What about Kim?
In the paper Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space? they embed different image classes on the StyleGan latent space trained on the FFHQ dataset. They show that the model is capable of embedding the images even was not trained on those image classes. On the paper we can seee that even these images can be embedded in the latent space, the interpolation between them leads to images with features of the image class it was trained with, in their case, faces. We decided to see what happens with our model when we try to embed images that are not cats. There couldn't be another images to be embedded than Kim Kardashian and the Spanish Ecce Omo. On the following images we can see that, unlike in the paper, our network is not capable or reconstructing the images. In the interpolations, as expected, we can also see the cat features that the network has learned.
.